Skip to content

Support Whisper training with Google Cloud buckets#70

Merged
huwenjie333 merged 18 commits intomainfrom
whisper_gcp
Apr 13, 2026
Merged

Support Whisper training with Google Cloud buckets#70
huwenjie333 merged 18 commits intomainfrom
whisper_gcp

Conversation

@huwenjie333
Copy link
Copy Markdown
Contributor

@huwenjie333 huwenjie333 commented Mar 6, 2026

This PR made the following changes to the salt library for running latest whisper finetuning :

  • during dataset loading:
    • add support for google cloud buckets.
    • add supports to limit number of examples per dataset.
    • skips the complicated source-targe matching logics for ASR model with skip_matching_asr argument.
    • iterate validation datasets also with multi-threads while keeping the same order.
  • update SALT_LANGUAGE_TOKENS_WHISPER in constants.py with 51 African languages for new whisper-salt ASR model training.
  • optimize the speech of multilingual_eval_fn in metrics.py by skipping unnecessary heavy CPU audio decoding process.
  • fixed a bug in augment_audio_noise function in preprocessing.py that makes the output audio to be zero-size.

The whisper finetuning/training scripts and configs has been moved sunbird-speech repo under speech-to-text/whisper directory.


[Depreciated]

This PR adds the support of google cloud buckets for the whisper training pipeline, and made several other changes:

  • load the parquets datasets from gcs:// path with datasets.load_dataset and cast the audio column to datasets.Audio format.
  • create a setup shell script that intalls the dependencies and configure the google cloud credentials.
  • load modules such as salt.datasets from the current repo instead of https://github.com/jqug/salt.git
  • move the yaml config from training notebook to a separate file
  • fix several training errors with following changes:
    • Use BF16 instead of FP16 for training
    • set gradient_checkpointing=False
    • add torch_dtype=torch.float32 when loading the model weights
    • updates model.generation_config based on requirements from the new version.

Overfit experiment

An overfit experiment with just 100 examples was done to verify the changes:
MLflow run1 with evaluation metrics: https://mlflow-sunbird-ce0ecfc14244.herokuapp.com/#/experiments/0/runs/2d488acdc39146e9af9da07c00128d49/model-metrics
MLfLow run2 with GPU utilization: https://mlflow.sunbird.ai/#/experiments/0/runs/811bbdf051f44597bd90c3376cfc9309/system-metrics

Screenshot 2026-03-06 at 7 26 14 PM Screenshot 2026-03-05 at 5 05 26 PM

TODO

  • we need to update the salt.constants.SALT_LANGUAGE_TOKENS_WHISPER to support new languages. Currently we only have the following:
SALT_LANGUAGE_TOKENS_WHISPER = {
    # Exact/close mapping
    'eng': 50259,
    'swa': 50318,
    # Overwrite unused language tokens
    'ach': 50357,
    'lgg': 50356,
    'lug': 50355,
    'nyn': 50354,
    'teo': 50353,
    'xog': 50352,
    'ttj': 50351,
    'kin': 50350,
    'myx': 50349,
    'kik': 50348,
}
  • currently each evaluation step takes 3-4 mins. I'm not sure whether it is expected
image

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@huwenjie333 huwenjie333 changed the title [WIP] Run Whisper training with Google Cloud buckets Run Whisper training with Google Cloud buckets Mar 6, 2026
@huwenjie333 huwenjie333 requested review from ak3ra, evie-8 and jqug March 6, 2026 12:02
Copy link
Copy Markdown
Contributor

@jqug jqug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks good.
Just one thing, let's take out the gcloud auth for now and maybe mention in a comment in the file that this may be necessary.

Comment thread whisper_training_setup.sh Outdated
@ak3ra
Copy link
Copy Markdown
Contributor

ak3ra commented Mar 9, 2026

We should consider merging this notebook into the dedicated sunbird-speech repo:
I have been working on refactoring it here https://github.com/SunbirdAI/sunbird-speech

Comment thread notebooks/training/configs/whisper_finetuning_gcs.yaml Outdated
@huwenjie333 huwenjie333 changed the title Run Whisper training with Google Cloud buckets Support Whisper training with Google Cloud buckets Apr 6, 2026
@huwenjie333 huwenjie333 requested a review from jqug April 8, 2026 03:23
Copy link
Copy Markdown
Contributor

@jqug jqug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

I double checked the language token IDs comparing with the Whisper tokenizer, and they look right. Actually I didn't realise that Whisper supports so many African languages already :)

Comment thread dataset.py
"target": row.get("text"),
"target.language": row.get("language"),
}
yield example
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. A further improvement for later, in case it's an ASR/audio dataset and the format already matches, is not to use a generator at all - we just load the huggingface datasets and concatenate them. That should reduce CPU bottleneck and could improve GPU utilisation.

@huwenjie333 huwenjie333 merged commit d3d1ead into main Apr 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants